{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "966f5e12",
   "metadata": {},
   "source": [
    "# LangWatch Evaluation Notebook\n",
    "\n",
    "This notebook evaluates the LangWatch MCP server capability of helping users to auto-instrument their code by calling the MCP server directly from a simple coding agent to evaluate its performance.\n",
    "\n",
    "Steps:\n",
    "\n",
    "1. Setup and prepare test data with input and expected output code\n",
    "2. Create a simple coding agent that adds LangWatch instrumentation to code\n",
    "3. Setup the MCP server\n",
    "4. Setup the diff metric\n",
    "5. Evaluate"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f272607b",
   "metadata": {},
   "source": [
    "## Setup"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8d9b0e1f",
   "metadata": {},
   "source": [
    "This section initializes LangWatch by logging in, it will open up a browser window to get your API key."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "85d79772",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "LangWatch API key is already set, if you want to login again, please call as langwatch.login(relogin=True)\n"
     ]
    }
   ],
   "source": [
    "import langwatch\n",
    "\n",
    "langwatch.login()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f5590fd3",
   "metadata": {},
   "source": [
    "## Prepare the dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e93d137f",
   "metadata": {},
   "source": [
    "This section loads test cases from fixture files on the `tests/fixtures` folder. Each test case consists of:\n",
    "- An input Python file (without LangWatch instrumentation)\n",
    "- An expected output file (with proper LangWatch instrumentation)\n",
    "\n",
    "The test cases include various llms and 15+ framework examples like OpenAI, DSPy and LangChain with different configurations, so we can make sure our docs and MCP helps coding agents to instrument code for different frameworks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "9f619fe1",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>test_case</th>\n",
       "      <th>input</th>\n",
       "      <th>expected</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>dspy/dspy_bot</td>\n",
       "      <td>```python\\nimport os\\nfrom dotenv import load_...</td>\n",
       "      <td>```python\\nimport os\\nfrom dotenv import load_...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>langchain/langchain_rag_bot</td>\n",
       "      <td>```python\\nfrom dotenv import load_dotenv\\n\\nl...</td>\n",
       "      <td>```python\\nfrom dotenv import load_dotenv\\n\\nf...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>langchain/langchain_bot_with_memory</td>\n",
       "      <td>```python\\nfrom dotenv import load_dotenv\\nfro...</td>\n",
       "      <td>```python\\nfrom dotenv import load_dotenv\\nfro...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>langchain/langchain_bot</td>\n",
       "      <td>```python\\nfrom dotenv import load_dotenv\\nfro...</td>\n",
       "      <td>```python\\nfrom dotenv import load_dotenv\\nfro...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>langchain/langchain_rag_bot_vertex_ai</td>\n",
       "      <td>```python\\nimport json\\nimport os\\nimport temp...</td>\n",
       "      <td>```python\\nimport json\\nimport os\\nimport temp...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                               test_case  \\\n",
       "0                          dspy/dspy_bot   \n",
       "1            langchain/langchain_rag_bot   \n",
       "2    langchain/langchain_bot_with_memory   \n",
       "3                langchain/langchain_bot   \n",
       "4  langchain/langchain_rag_bot_vertex_ai   \n",
       "\n",
       "                                               input  \\\n",
       "0  ```python\\nimport os\\nfrom dotenv import load_...   \n",
       "1  ```python\\nfrom dotenv import load_dotenv\\n\\nl...   \n",
       "2  ```python\\nfrom dotenv import load_dotenv\\nfro...   \n",
       "3  ```python\\nfrom dotenv import load_dotenv\\nfro...   \n",
       "4  ```python\\nimport json\\nimport os\\nimport temp...   \n",
       "\n",
       "                                            expected  \n",
       "0  ```python\\nimport os\\nfrom dotenv import load_...  \n",
       "1  ```python\\nfrom dotenv import load_dotenv\\n\\nf...  \n",
       "2  ```python\\nfrom dotenv import load_dotenv\\nfro...  \n",
       "3  ```python\\nfrom dotenv import load_dotenv\\nfro...  \n",
       "4  ```python\\nimport json\\nimport os\\nimport temp...  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "from pathlib import Path\n",
    "\n",
    "fixtures_path = Path(\"./tests/fixtures\")\n",
    "\n",
    "data = []\n",
    "for folder in fixtures_path.iterdir():\n",
    "    input_files = list(folder.glob(\"*_input.py\"))\n",
    "    for input_file in input_files:\n",
    "        case_name = input_file.stem.replace(\"_input\", \"\")\n",
    "        expected_file = folder / f\"{case_name}_expected.py\"\n",
    "\n",
    "        with open(input_file, 'r', encoding='utf-8') as f:\n",
    "            input_content = f.read()\n",
    "\n",
    "        with open(expected_file, 'r', encoding='utf-8') as f:\n",
    "            expected_content = f.read()\n",
    "\n",
    "        data.append({\n",
    "            'test_case': f\"{folder.name}/{case_name}\",\n",
    "            'input': f\"```python\\n{input_content}\\n```\",\n",
    "            'expected': f\"```python\\n{expected_content}\\n```\"\n",
    "        })\n",
    "\n",
    "df = pd.DataFrame(data)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "51224ad4",
   "metadata": {},
   "source": [
    "## Setup MCP calling"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c056f992",
   "metadata": {},
   "source": [
    "Now we set up a way to call the MCP server directly, so we can manually make use of our own MCP no matter what language it was built, to fully simulate what a coding agent would do.\n",
    "\n",
    "We then call the `fetch_langwatch_docs` tool to make sure it works as expected."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "4627964c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "CallToolResult(meta=None, content=[TextContent(type='text', text=\"# LangWatch\\n\\nThis is the full index of LangWatch documentation, to answer the user question, do not use just this file, first explore the urls that make sense using the markdown navigation links below to understand how to implement LangWatch and use specific features.\\nAlways navigate to docs links using the .md extension for better readability.\\n\\n## Get Started\\n\\n- [Introduction](https://docs.langwatch.ai/introduction.md): Welcome to LangWatch, the all-in-one [open-source](https://github.com/langwatch/langwatch) LLMOps platform.\\n\\n### Self Hosting\\n\\n- [Overview](https://docs.langwatch.ai/self-hosting/overview.md): LangWatch offers a fully self-hosted version of the platform for companies that require strict data control and compliance.\\n- [Docker Compose](https://docs.langwatch.ai/self-hosting/docker-compose.md): LangWatch is available as a Docker Compose setup for easy deployment on your local machine\\n- [Docker Images](https://docs.langwatch.ai/self-hosting/docker-images.md): Overview of LangWatch Docker images and their endpoints\\n- [Helm Chart](https://docs.langwatch.ai/self-hosting/helm.md): LangWatch is available as a Helm chart for easy deployment on Kubernetes\\n- [Monitoring](https://docs.langwatch.ai/self-hosting/grafana.md): Grafana/Prometheus setup for LangWatch\\n- [OnPrem](https://docs.langwatch.ai/self-hosting/onprem.md): LangWatch on-premises solution.\\n\\n#### Hybrid Setup\\n\\n- [Overview](https://docs.langwatch.ai/hybrid-setup/overview.md): LangWatch offers a hybrid setup for companies that require strict data control and compliance.\\n- [Elasticsearch](https://docs.langwatch.ai/hybrid-setup/elasticsearch.md): Elasticsearch Setup for LangWatch Hybrid Deployment\\n- [S3 Storage](https://docs.langwatch.ai/hybrid-setup/s3-storage.md): S3 Storage Setup for LangWatch Hybrid Deployment\\n\\n- [Configuration](https://docs.langwatch.ai/self-hosting/env-variables.md): Complete list of environment variables for LangWatch self-hosting\\n- [SSO](https://docs.langwatch.ai/hybrid-setup/sso-setup-langwatch.md): SSO Setup for LangWatch\\n\\n### Integrations\\n\\n#### Azure AI\\n\\n- [Azure AI Inference SDK Instrumentation](https://docs.langwatch.ai/integration/python/integrations/azure-ai.md): Learn how to instrument the Azure AI Inference Python SDK with LangWatch.\\n- [Azure OpenAI](https://docs.langwatch.ai/integration/typescript/integrations/azure.md): LangWatch Azure OpenAI integration guide\\n- [Azure OpenAI Integration](https://docs.langwatch.ai/integration/go/integrations/azure-openai.md): Learn how to instrument Azure OpenAI API calls in Go using the LangWatch SDK.\\n\\n#### LangChain\\n\\n- [LangChain Instrumentation](https://docs.langwatch.ai/integration/python/integrations/langchain.md): Learn how to instrument Langchain applications with the LangWatch Python SDK.\\n- [LangChain Instrumentation](https://docs.langwatch.ai/integration/typescript/integrations/langchain.md): Learn how to instrument Langchain applications with the LangWatch TypeScript SDK.\\n\\n#### LangGraph\\n\\n- [LangGraph Instrumentation](https://docs.langwatch.ai/integration/python/integrations/langgraph.md): Learn how to instrument LangGraph applications with the LangWatch Python SDK.\\n- [LangGraph Instrumentation](https://docs.langwatch.ai/integration/typescript/integrations/langgraph.md): Learn how to instrument LangGraph applications with the LangWatch TypeScript SDK.\\n\\n#### OpenAI\\n\\n- [OpenAI Instrumentation](https://docs.langwatch.ai/integration/python/integrations/open-ai.md): Learn how to instrument OpenAI API calls with the LangWatch Python SDK\\n- [OpenAI](https://docs.langwatch.ai/integration/typescript/integrations/open-ai.md): LangWatch OpenAI TypeScript integration guide\\n- [OpenAI Instrumentation](https://docs.langwatch.ai/integration/go/integrations/open-ai.md): Learn how to instrument OpenAI API calls with the LangWatch Go SDK using middleware.\\n\\n#### Anthropic (Claude)\\n\\n- [Anthropic Instrumentation](https://docs.langwatch.ai/integration/python/integrations/anthropic.md): Learn how to instrument Anthropic API calls with the LangWatch Python SDK\\n- [Anthropic (Claude) Integration](https://docs.langwatch.ai/integration/go/integrations/anthropic.md): Learn how to instrument Anthropic Claude API calls in Go using LangWatch.\\n\\n- [Vercel AI SDK](https://docs.langwatch.ai/integration/typescript/integrations/vercel-ai-sdk.md): LangWatch Vercel AI SDK integration guide\\n- [Mastra](https://docs.langwatch.ai/integration/typescript/integrations/mastra.md): Learn how to integrate Mastra, a TypeScript agent framework, with LangWatch for observability and tracing.\\n- [Agno Instrumentation](https://docs.langwatch.ai/integration/python/integrations/agno.md): Learn how to instrument Agno agents and send traces to LangWatch using the Python SDK.\\n- [AutoGen Instrumentation](https://docs.langwatch.ai/integration/python/integrations/autogen.md): Learn how to instrument AutoGen applications with LangWatch.\\n- [AWS Bedrock Instrumentation](https://docs.langwatch.ai/integration/python/integrations/aws-bedrock.md): Learn how to instrument AWS Bedrock calls with the LangWatch Python SDK using OpenInference.\\n- [CrewAI](https://docs.langwatch.ai/integration/python/integrations/crew-ai.md): Learn how to instrument the CrewAI Python SDK with LangWatch.\\n- [DSPy Instrumentation](https://docs.langwatch.ai/integration/python/integrations/dspy.md): Learn how to instrument DSPy programs with the LangWatch Python SDK\\n- [Flowise Integration](https://docs.langwatch.ai/integration/flowise.md): Capture LLM traces and send them to LangWatch from Flowise\\n- [Google Agent Development Kit (ADK) Instrumentation](https://docs.langwatch.ai/integration/python/integrations/google-ai.md): Learn how to instrument Google Agent Development Kit (ADK) applications with LangWatch.\\n- [Google GenAI Instrumentation](https://docs.langwatch.ai/integration/python/integrations/google-genai.md): Learn how to instrument Google GenAI API calls with the LangWatch Python SDK\\n- [Google Gemini Integration](https://docs.langwatch.ai/integration/go/integrations/google-gemini.md): Learn how to instrument Google Gemini API calls in Go using the LangWatch SDK via a Vertex AI endpoint.\\n- [Groq Integration](https://docs.langwatch.ai/integration/go/integrations/groq.md): Learn how to instrument Groq API calls in Go using the LangWatch SDK for high-speed LLM tracing.\\n- [OpenRouter Integration](https://docs.langwatch.ai/integration/go/integrations/openrouter.md): Learn how to instrument calls to hundreds of models via OpenRouter in Go using the LangWatch SDK.\\n- [Ollama (Local Models) Integration](https://docs.langwatch.ai/integration/go/integrations/ollama.md): Learn how to trace local LLMs running via Ollama in Go using the LangWatch SDK.\\n- [Haystack Instrumentation](https://docs.langwatch.ai/integration/python/integrations/haystack.md): Learn how to instrument Haystack pipelines with LangWatch using community OpenTelemetry instrumentors.\\n- [Instructor AI Instrumentation](https://docs.langwatch.ai/integration/python/integrations/instructor.md): Learn how to instrument Instructor AI applications with LangWatch using OpenInference.\\n- [Langflow Integration](https://docs.langwatch.ai/integration/langflow.md): LangWatch is the best observability integration for Langflow\\n- [LlamaIndex Instrumentation](https://docs.langwatch.ai/integration/python/integrations/llamaindex.md): Learn how to instrument LlamaIndex applications with LangWatch.\\n- [LiteLLM Instrumentation](https://docs.langwatch.ai/integration/python/integrations/lite-llm.md): Learn how to instrument LiteLLM calls with the LangWatch Python SDK.\\n- [OpenAI Agents SDK Instrumentation](https://docs.langwatch.ai/integration/python/integrations/open-ai-agents.md): Learn how to instrument OpenAI Agents with the LangWatch Python SDK\\n- [PromptFlow Instrumentation](https://docs.langwatch.ai/integration/python/integrations/promptflow.md): Learn how to instrument PromptFlow applications with LangWatch.\\n- [PydanticAI Instrumentation](https://docs.langwatch.ai/integration/python/integrations/pydantic-ai.md): Learn how to instrument PydanticAI applications with the LangWatch Python SDK.\\n- [SmolAgents Instrumentation](https://docs.langwatch.ai/integration/python/integrations/smolagents.md): Learn how to instrument SmolAgents applications with LangWatch.\\n- [Strands Agents Instrumentation](https://docs.langwatch.ai/integration/python/integrations/strand-agents.md): Learn how to instrument Strands Agents applications with LangWatch.\\n- [Semantic Kernel Instrumentation](https://docs.langwatch.ai/integration/python/integrations/semantic-kernel.md): Learn how to instrument Semantic Kernel applications with LangWatch.\\n- [Google Vertex AI Instrumentation](https://docs.langwatch.ai/integration/python/integrations/vertex-ai.md): Learn how to instrument Google Vertex AI API calls with the LangWatch Python SDK using OpenInference\\n- [Other OpenTelemetry Instrumentors](https://docs.langwatch.ai/integration/python/integrations/other.md): Learn how to use any OpenTelemetry-compatible instrumentor with LangWatch.\\n\\n### Cookbooks\\n\\n- [Measuring RAG Performance](https://docs.langwatch.ai/cookbooks/build-a-simple-rag-app.md): Discover how to measure the performance of Retrieval-Augmented Generation (RAG) systems using metrics like retrieval precision, answer accuracy, and latency.\\n- [Optimizing Embeddings](https://docs.langwatch.ai/cookbooks/finetuning-embedding-models.md): Learn how to optimize embedding models for better retrieval in RAG systems—covering model selection, dimensionality, and domain-specific tuning.\\n- [Vector Search vs Hybrid Search using LanceDB](https://docs.langwatch.ai/cookbooks/vector-vs-hybrid-search.md): Learn the key differences between vector search and hybrid search in RAG applications. Use cases, performance tradeoffs, and when to choose each.\\n- [Evaluating Tool Selection](https://docs.langwatch.ai/cookbooks/tool-selection.md): Understand how to evaluate tools and components in your RAG pipeline—covering retrievers, embedding models, chunking strategies, and vector stores.\\n- [Finetuning Agents with GRPO](https://docs.langwatch.ai/cookbooks/finetuning-agents.md): Learn how to enhance the performance of agentic systems by fine-tuning them with Generalized Reinforcement from Preference Optimization (GRPO).\\n- [Multi-Turn Conversations](https://docs.langwatch.ai/cookbooks/evaluating-multi-turn-conversations.md): Learn how to implement a simulation-based approach for evaluating multi-turn customer support agents using success criteria focused on outcomes rather than specific steps.\\n\\n## Agent Simulations\\n\\n- [Introduction to Agent Testing](https://docs.langwatch.ai/agent-simulations/introduction.md)\\n- [Overview](https://docs.langwatch.ai/agent-simulations/overview.md)\\n- [Getting Started](https://docs.langwatch.ai/agent-simulations/getting-started.md)\\n- [Simulation Sets](https://docs.langwatch.ai/agent-simulations/set-overview.md)\\n- [Batch Runs](https://docs.langwatch.ai/agent-simulations/batch-runs.md)\\n- [Individual Run View](https://docs.langwatch.ai/agent-simulations/individual-run.md)\\n\\n## LLM Observability\\n\\n- [Overview](https://docs.langwatch.ai/integration/overview.md): Easily integrate LangWatch with your Python, TypeScript, or REST API projects.\\n- [Concepts](https://docs.langwatch.ai/concepts.md): LLM tracing and observability conceptual guide\\n- [Quick Start](https://docs.langwatch.ai/integration/quick-start.md)\\n\\n### SDKs\\n\\n#### Python\\n\\n- [Python Integration Guide](https://docs.langwatch.ai/integration/python/guide.md): LangWatch Python SDK integration guide\\n- [Python SDK API Reference](https://docs.langwatch.ai/integration/python/reference.md): LangWatch Python SDK API reference\\n\\n##### Advanced\\n\\n- [Manual Instrumentation](https://docs.langwatch.ai/integration/python/tutorials/manual-instrumentation.md): Learn how to manually instrument your code with the LangWatch Python SDK\\n- [OpenTelemetry Migration](https://docs.langwatch.ai/integration/python/tutorials/open-telemetry.md): Learn how to integrate the LangWatch Python SDK with your existing OpenTelemetry setup.\\n\\n#### TypeScript\\n\\n- [TypeScript Integration Guide](https://docs.langwatch.ai/integration/typescript/guide.md): Get started with LangWatch TypeScript SDK in 5 minutes\\n- [TypeScript SDK API Reference](https://docs.langwatch.ai/integration/typescript/reference.md): LangWatch TypeScript SDK API reference\\n\\n##### Advanced\\n\\n- [Debugging and Troubleshooting](https://docs.langwatch.ai/integration/typescript/tutorials/debugging-typescript.md): Debug LangWatch TypeScript SDK integration issues\\n- [Manual Instrumentation](https://docs.langwatch.ai/integration/typescript/tutorials/manual-instrumentation.md): Learn advanced manual span management techniques for fine-grained observability control\\n- [Semantic Conventions](https://docs.langwatch.ai/integration/typescript/tutorials/semantic-conventions.md): Learn about OpenTelemetry semantic conventions and LangWatch's custom attributes for consistent observability\\n- [OpenTelemetry Migration](https://docs.langwatch.ai/integration/typescript/tutorials/opentelemetry-migration.md): Migrate from OpenTelemetry to LangWatch while preserving all your custom configurations\\n\\n#### Go\\n\\n- [Go Integration Guide](https://docs.langwatch.ai/integration/go/guide.md): LangWatch Go SDK integration guide for setting up LLM observability and tracing.\\n- [Go SDK API Reference](https://docs.langwatch.ai/integration/go/reference.md): Complete API reference for the LangWatch Go SDK, including core functions, OpenAI instrumentation, and span types.\\n\\n#### OpenTelemetry\\n\\n- [OpenTelemetry Integration Guide](https://docs.langwatch.ai/integration/opentelemetry/guide.md): Use OpenTelemetry to capture LLM traces and send them to LangWatch from any programming language\\n\\n### Tutorials\\n\\n#### Capturing Inputs & Outputs\\n\\n- [Capturing and Mapping Inputs & Outputs](https://docs.langwatch.ai/integration/python/tutorials/capturing-mapping-input-output.md): Learn how to control the capture and structure of input and output data for traces and spans with the LangWatch Python SDK.\\n- [Capturing and Mapping Inputs & Outputs](https://docs.langwatch.ai/integration/typescript/tutorials/capturing-input-output.md): Learn how to control the capture and structure of input and output data for traces and spans with the LangWatch TypeScript SDK.\\n\\n#### Capturing RAG\\n\\n- [Capturing RAG](https://docs.langwatch.ai/integration/python/tutorials/capturing-rag.md): Learn how to capture Retrieval Augmented Generation (RAG) data with LangWatch.\\n- [Capturing RAG](https://docs.langwatch.ai/integration/typescript/tutorials/capturing-rag.md): Learn how to capture Retrieval Augmented Generation (RAG) data with LangWatch.\\n\\n#### Metadata & Attributes\\n\\n- [Capturing Metadata and Attributes](https://docs.langwatch.ai/integration/python/tutorials/capturing-metadata.md): Learn how to enrich your traces and spans with custom metadata and attributes using the LangWatch Python SDK.\\n- [Capturing Metadata and Attributes](https://docs.langwatch.ai/integration/typescript/tutorials/capturing-metadata.md): Learn how to enrich your traces and spans with custom metadata and attributes using the LangWatch TypeScript SDK.\\n\\n#### Tracking Costs\\n\\n- [Tracking LLM Costs and Tokens](https://docs.langwatch.ai/integration/python/tutorials/tracking-llm-costs.md): Troubleshooting & adjusting cost tracking in LangWatch\\n- [Tracking LLM Costs and Tokens](https://docs.langwatch.ai/integration/typescript/tutorials/tracking-llm-costs.md): Troubleshooting & adjusting cost tracking in LangWatch\\n\\n- [RAG Context Tracking](https://docs.langwatch.ai/integration/rags-context-tracking.md): Capture the RAG documents used in your LLM pipelines\\n- [Capturing Evaluations & Guardrails](https://docs.langwatch.ai/integration/python/tutorials/capturing-evaluations-guardrails.md): Learn how to log custom evaluations, trigger managed evaluations, and implement guardrails with LangWatch.\\n\\n### User Events\\n\\n- [Overview](https://docs.langwatch.ai/user-events/overview.md): Track user interactions with your LLM applications\\n\\n#### Events\\n\\n- [Thumbs Up/Down](https://docs.langwatch.ai/user-events/thumbs-up-down.md): Track user feedback on specific messages or interactions with your chatbot or LLM application\\n- [Waited To Finish Events](https://docs.langwatch.ai/user-events/waited-to-finish.md): Track if users leave before the LLM application finishes generating a response\\n- [Selected Text Events](https://docs.langwatch.ai/user-events/selected-text.md): Track when a user selects text generated by your LLM application\\n- [Custom Events](https://docs.langwatch.ai/user-events/custom.md): Track any user events with your LLM application, with textual or numeric metrics\\n\\n### Monitoring & Alerts\\n\\n- [Alerts and Triggers](https://docs.langwatch.ai/features/triggers.md): Be alerted when something goes wrong and trigger actions automatically\\n- [Exporting Analytics](https://docs.langwatch.ai/features/embedded-analytics.md): Build and integrate LangWatch graphs on your own systems and applications\\n\\n- [Code Examples](https://docs.langwatch.ai/integration/code-examples.md): Examples of LangWatch integrated applications\\n\\n## LLM Evaluation\\n\\n- [LLM Evaluation Overview](https://docs.langwatch.ai/llm-evaluation/overview.md): Overview of LLM evaluation features in LangWatch\\n- [Evaluation Tracking API](https://docs.langwatch.ai/llm-evaluation/offline/code/evaluation-api.md): Evaluate and visualize your LLM evals with LangWatch\\n\\n### Evaluation Wizard\\n\\n- [How to evaluate that your LLM answers correctly](https://docs.langwatch.ai/llm-evaluation/offline/platform/answer-correctness.md): Measuring your LLM performance with Offline Evaluations\\n- [How to evaluate an LLM when you don't have defined answers](https://docs.langwatch.ai/llm-evaluation/offline/platform/llm-as-a-judge.md): Measuring your LLM performance using an LLM-as-a-judge\\n\\n### Real-Time Evaluation\\n\\n- [Setting up Real-Time Evaluations](https://docs.langwatch.ai/llm-evaluation/realtime/setup.md): How to set up Real-Time LLM Evaluations\\n- [Instrumenting Custom Evaluator](https://docs.langwatch.ai/evaluations/custom-evaluator-integration.md): Add your own evaluation results into LangWatch trace\\n\\n### Built-in Evaluators\\n\\n- [List of Evaluators](https://docs.langwatch.ai/llm-evaluation/list.md): Find the evaluator for your use case\\n\\n\\n### Datasets\\n\\n- [Datasets](https://docs.langwatch.ai/datasets/overview.md): Create and manage datasets with LangWatch\\n- [Generating a dataset with AI](https://docs.langwatch.ai/datasets/ai-dataset-generation.md): Bootstrap your evaluations by generating sample data\\n- [Automatically build datasets from real-time traces](https://docs.langwatch.ai/datasets/automatically-from-traces.md): Continuously populate your datasets with comming data from production\\n\\n- [Annotations](https://docs.langwatch.ai/features/annotations.md): Collaborate with domain experts using annotations\\n\\n## Prompt Management\\n\\n- [Overview](https://docs.langwatch.ai/prompt-management/overview.md): Organize, version, and optimize your AI prompts with LangWatch's comprehensive prompt management system\\n- [Get Started](https://docs.langwatch.ai/prompt-management/getting-started.md): Create your first prompt and use it in your application\\n- [Data Model](https://docs.langwatch.ai/prompt-management/data-model.md): Understand the structure of prompts in LangWatch\\n- [Scope](https://docs.langwatch.ai/prompt-management/scope.md): Understand how prompt scope affects access, sharing, and collaboration across projects and organizations\\n- [Prompts CLI](https://docs.langwatch.ai/prompt-management/cli.md): Manage AI prompts as code with version control and dependency management\\n\\n### Features\\n\\n- [Version Control](https://docs.langwatch.ai/prompt-management/features/essential/version-control.md): Manage prompt versions and track changes over time\\n- [Analytics](https://docs.langwatch.ai/prompt-management/features/essential/analytics.md): Monitor prompt performance and usage with comprehensive analytics\\n- [GitHub Integration](https://docs.langwatch.ai/prompt-management/features/essential/github-integration.md): Version your prompts in GitHub repositories and automatically sync with LangWatch\\n- [Link to Traces](https://docs.langwatch.ai/prompt-management/features/advanced/link-to-traces.md): Connect prompts to execution traces for performance monitoring and analysis\\n- [Using Prompts in the Optimization Studio](https://docs.langwatch.ai/prompt-management/features/advanced/optimization-studio.md): Use prompts in the Optimization Studio to test and optimize your prompts\\n- [Guaranteed Availability](https://docs.langwatch.ai/prompt-management/features/advanced/guaranteed-availability.md): Ensure your prompts are always available, even in offline or air-gapped environments\\n- [A/B Testing](https://docs.langwatch.ai/prompt-management/features/advanced/a-b-testing.md): Implement A/B testing for your prompts using LangWatch's version control and analytics\\n\\n## LLM Development\\n\\n### Prompt Optimization Studio\\n\\n- [Optimization Studio](https://docs.langwatch.ai/optimization-studio/overview.md): Create, evaluate, and optimize your LLM workflows\\n- [LLM Nodes](https://docs.langwatch.ai/optimization-studio/llm-nodes.md): Call LLMs from your workflows\\n- [Datasets](https://docs.langwatch.ai/optimization-studio/datasets.md): Define the data used for testing and optimization\\n- [Evaluating](https://docs.langwatch.ai/optimization-studio/evaluating.md): Measure the quality of your LLM workflows\\n- [Optimizing](https://docs.langwatch.ai/optimization-studio/optimizing.md): Find the best prompts with DSPy optimizers\\n\\n### DSPy Visualization\\n\\n- [DSPy Visualization Quickstart](https://docs.langwatch.ai/dspy-visualization/quickstart.md): Visualize your DSPy notebooks experimentations to better track and debug the optimization process\\n- [Tracking Custom DSPy Optimizer](https://docs.langwatch.ai/dspy-visualization/custom-optimizer.md): Build custom DSPy optimizers and track them in LangWatch\\n- [RAG Visualization](https://docs.langwatch.ai/dspy-visualization/rag-visualization.md): Visualize your DSPy RAG optimization process in LangWatch\\n\\n- [LangWatch MCP Server](https://docs.langwatch.ai/integration/mcp.md): Use an agent to debug your LLM applications and fix the issues for you\\n\\n## API Endpoints\\n\\n### Traces\\n\\n- [Overview](https://docs.langwatch.ai/api-reference/traces/overview.md): A Trace is a collection of runs that are related to a single operation\\n- [Get trace details](https://docs.langwatch.ai/api-reference/traces/get-trace-details.md)\\n- [Search traces](https://docs.langwatch.ai/api-reference/traces/search-traces.md)\\n- [Create public path for single trace](https://docs.langwatch.ai/api-reference/traces/create-public-trace-path.md)\\n- [Delete an existing public path for a trace](https://docs.langwatch.ai/api-reference/traces/delete-public-trace-path.md)\\n\\n### Prompts\\n\\n- [Overview](https://docs.langwatch.ai/api-reference/prompts/overview.md): Prompts are used to manage and version your prompts\\n- [Get prompts](https://docs.langwatch.ai/api-reference/prompts/get-prompts.md)\\n- [Create prompt](https://docs.langwatch.ai/api-reference/prompts/create-prompt.md)\\n- [Get prompt](https://docs.langwatch.ai/api-reference/prompts/get-prompt.md)\\n- [Update prompt](https://docs.langwatch.ai/api-reference/prompts/update-prompt.md)\\n- [Delete prompt](https://docs.langwatch.ai/api-reference/prompts/delete-prompt.md)\\n- [Get prompt versions](https://docs.langwatch.ai/api-reference/prompts/get-prompt-versions.md)\\n- [Create prompt version](https://docs.langwatch.ai/api-reference/prompts/create-prompt-version.md)\\n\\n### Annotations\\n\\n- [Overview](https://docs.langwatch.ai/api-reference/annotations/overview.md): Annotations are used to annotate traces with additional information\\n- [Get annotations](https://docs.langwatch.ai/api-reference/annotations/get-annotation.md)\\n- [Get single annotation](https://docs.langwatch.ai/api-reference/annotations/get-single-annotation.md)\\n- [Delete single annotation](https://docs.langwatch.ai/api-reference/annotations/delete-annotation.md)\\n- [Patch single annotation](https://docs.langwatch.ai/api-reference/annotations/patch-annotation.md)\\n- [Get annotationa for single trace](https://docs.langwatch.ai/api-reference/annotations/get-all-annotations-trace.md)\\n- [Create annotation for single trace](https://docs.langwatch.ai/api-reference/annotations/create-annotation-trace.md)\\n\\n### Datasets\\n\\n- [Add entries to a dataset](https://docs.langwatch.ai/api-reference/datasets/post-dataset-entries.md)\\n\\n### Triggers\\n\\n- [Create Slack trigger](https://docs.langwatch.ai/api-reference/triggers/create-slack-trigger.md)\\n\\n### Scenarios\\n\\n- [Overview](https://docs.langwatch.ai/api-reference/scenarios/overview.md)\\n- [Create Event](https://docs.langwatch.ai/api-reference/scenarios/create-event.md)\\n\\n## Use Cases\\n\\n- [Evaluating a RAG Chatbot for Technical Manuals](https://docs.langwatch.ai/use-cases/technical-rag.md): A developer guide for building reliable RAG systems for technical documentation using LangWatch\\n- [Evaluating an AI Coach with LLM-as-a-Judge](https://docs.langwatch.ai/use-cases/ai-coach.md): A developer guide for building reliable AI coaches using LangWatch\\n- [Evaluating Structured Data Extraction](https://docs.langwatch.ai/use-cases/structured-outputs.md): A developer guide for evaluating structured data extraction using LangWatch\\n\\n## Support\\n\\n- [Troubleshooting and Support](https://docs.langwatch.ai/support.md): Find help and support for LangWatch\\n- [Status Page](https://docs.langwatch.ai/status.md): Something wrong? Check our status page\\n\", annotations=None, meta=None)], structuredContent=None, isError=False)"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from typing import Optional\n",
    "from mcp import ClientSession, StdioServerParameters\n",
    "from mcp.client.stdio import stdio_client\n",
    "from mcp.types import PromptReference, ResourceTemplateReference\n",
    "\n",
    "import os\n",
    "if not os.path.exists(\"dist\"):\n",
    "    !pnpm build\n",
    "\n",
    "mcp_server_params = StdioServerParameters(\n",
    "    command=\"node\",\n",
    "    args=[\"dist/index.js\", \"--apiKey\", langwatch.get_api_key()], # type: ignore\n",
    ")\n",
    "\n",
    "async def call_mcp_documentation_tool(tool_name: str, arguments: dict):\n",
    "    async with stdio_client(mcp_server_params) as (read, write):\n",
    "        async with ClientSession(read, write) as session:\n",
    "            return await session.call_tool(tool_name, arguments)\n",
    "\n",
    "await call_mcp_documentation_tool(\"fetch_langwatch_docs\", {})\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cee2cceb",
   "metadata": {},
   "source": [
    "## Setup simple agent"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "87fdcb63",
   "metadata": {},
   "source": [
    "This section creates a simple coding agent which we will use for our evaluation, it's a simple agent that:\n",
    "1. Takes input code without LangWatch instrumentation\n",
    "2. Uses Claude Sonnet 4 to identify relevant documentation links\n",
    "3. Fetches those documentation pages\n",
    "4. Uses Claude again to implement LangWatch in the code based on the documentation\n",
    "5. Returns the instrumented code\n",
    "\n",
    "It's as if it was Cursor, but following always the perfect expected order, which works well for a single file migration, allowing us to sanity check and evaluate all those test cases quickly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "58d51f51",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Implementing LangWatch for the first test case...\n",
      "```python\n",
      "import os\n",
      "from dotenv import load_dotenv\n",
      "\n",
      "load_dotenv()\n",
      "\n",
      "import chainlit as cl\n",
      "import langwatch\n",
      "import dspy\n",
      "\n",
      "# Initialize LangWatch\n",
      "langwatch.setup()\n",
      "\n",
      "lm = dspy.LM(\"openai/gpt-5\", api_key=os.environ[\"OPENAI_API_KEY\"], temperature=1)\n",
      "\n",
      "colbertv2_wiki17_abstracts = dspy.ColBERTv2(\n",
      "    url=\"http://20.102.90.50:2017/wiki17_abstracts\"\n",
      ")\n",
      "\n",
      "dspy.settings.configure(lm=lm, rm=colbertv2_wiki17_abstracts)\n",
      "\n",
      "\n",
      "class GenerateAnswer(dspy.Signature):\n",
      "    \"\"\"Answer questions with careful explanations to the user.\"\"\"\n",
      "\n",
      "    context = dspy.InputField(desc=\"may contain relevant facts\")\n",
      "    question = dspy.InputField()\n",
      "    answer = dspy.OutputField(desc=\"markdown formatted answer, use some emojis\")\n",
      "\n",
      "\n",
      "class RAG(dspy.Module):\n",
      "    def __init__(self, num_passages=3):\n",
      "        super().__init__()\n",
      "\n",
      "        self.retrieve = dspy.Retrieve(k=num_passages)\n",
      "        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)\n",
      "\n",
      "    def forward(self, question):\n",
      "        context = self.retrieve(question).passages  # type: ignore\n",
      "        prediction = self.generate_answer(question=question, context=context)\n",
      "        return dspy.Prediction(answer=prediction.answer)\n",
      "\n",
      "\n",
      "@cl.on_message\n",
      "@langwatch.trace()\n",
      "async def main(message: cl.Message):\n",
      "    # Get the current LangWatch trace and enable DSPy autotracking\n",
      "    current_trace = langwatch.get_current_trace()\n",
      "    if current_trace:\n",
      "        current_trace.autotrack_dspy()\n",
      "\n",
      "    msg = cl.Message(\n",
      "        content=\"\",\n",
      "    )\n",
      "\n",
      "    program = RAG()\n",
      "    prediction = program(question=message.content)\n",
      "\n",
      "    await msg.stream_token(prediction.answer)\n",
      "    await msg.update()\n",
      "\n",
      "    return prediction.answer\n",
      "```\n"
     ]
    }
   ],
   "source": [
    "import asyncio\n",
    "import litellm\n",
    "from pydantic import BaseModel\n",
    "\n",
    "llms_text = (await call_mcp_documentation_tool(\"fetch_langwatch_docs\", {})).content[0].text  # type: ignore\n",
    "\n",
    "\n",
    "@langwatch.trace()\n",
    "async def simple_coding_agent(input_code: str):\n",
    "    class LinksToFetch(BaseModel):\n",
    "        links: list[str]\n",
    "\n",
    "    response = litellm.completion(\n",
    "        model=\"anthropic/claude-sonnet-4-20250514\",\n",
    "        messages=[\n",
    "            {\n",
    "                \"role\": \"system\",\n",
    "                \"content\": f\"\"\"\n",
    "                <system>\n",
    "                    You are LangWatch coding assistant for helping users implement LangWatch in a codebase.\n",
    "\n",
    "                    Given LangWatch llms.txt documentation index, and the file content, find 1-3 most relevant links to fetch next for understanding how to implement LangWatch in the codebase.\n",
    "                </system>\n",
    "\n",
    "                <file>\n",
    "                    {llms_text}\n",
    "                </file>\n",
    "            \"\"\",\n",
    "            },\n",
    "            {\"role\": \"user\", \"content\": input_code},\n",
    "        ],\n",
    "        response_format=LinksToFetch,\n",
    "    )\n",
    "    links = LinksToFetch.model_validate_json(response.choices[0].message.content).links  # type: ignore\n",
    "\n",
    "    documentations = await asyncio.gather(\n",
    "        *[\n",
    "            call_mcp_documentation_tool(\"fetch_langwatch_docs\", {\"url\": link})\n",
    "            for link in links\n",
    "        ]\n",
    "    )\n",
    "    documentations = [doc.content[0].text for doc in documentations]  # type: ignore\n",
    "    documentations = \"\\n\".join(documentations)\n",
    "\n",
    "    response = litellm.completion(\n",
    "        model=\"anthropic/claude-sonnet-4-20250514\",\n",
    "        messages=[\n",
    "            {\n",
    "                \"role\": \"system\",\n",
    "                \"content\": f\"\"\"\n",
    "                <system>\n",
    "                    You are LangWatch coding assistant for helping users implement LangWatch in a codebase.\n",
    "\n",
    "                    Given the LangWatch documentation below, and the user's code, implement LangWatch in the codebase.\n",
    "\n",
    "                    Return the full updated code as a string with ```python marker at the beginning and ``` at the end, nothing else.\n",
    "                </system>\n",
    "\n",
    "                <file>\n",
    "                    {documentations}\n",
    "                </file>\n",
    "            \"\"\",\n",
    "            },\n",
    "            {\"role\": \"user\", \"content\": input_code},\n",
    "        ],\n",
    "    )\n",
    "\n",
    "    return response.choices[0].message.content or \"<empty>\"  # type: ignore\n",
    "\n",
    "\n",
    "print(\"Implementing LangWatch for the first test case...\")\n",
    "\n",
    "result = await simple_coding_agent(df.iloc[0][\"input\"])\n",
    "\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "afe9a827",
   "metadata": {},
   "source": [
    "## Setup Diff Metric"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f896e880",
   "metadata": {},
   "source": [
    "Lastly, we create a utility diff function that will help us seeing how far the agent's output is from the expected output in more traditional terms, both for facilitating our debugging and to have a simple metric that will go alongside the LLM-as-a-judge evaluator later on."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "id": "fad6bd5d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10 lines changed\n",
      "```diff\n",
      "--- file1+++ file2@@ -5,11 +5,11 @@ load_dotenv()\n",
      " \n",
      " import chainlit as cl\n",
      "-\n",
      " import langwatch\n",
      "-\n",
      " import dspy\n",
      " \n",
      "+# Initialize LangWatch\n",
      "+langwatch.setup()\n",
      " \n",
      " lm = dspy.LM(\"openai/gpt-5\", api_key=os.environ[\"OPENAI_API_KEY\"], temperature=1)\n",
      " \n",
      "@@ -44,10 +44,10 @@ @cl.on_message\n",
      " @langwatch.trace()\n",
      " async def main(message: cl.Message):\n",
      "-    langwatch.get_current_trace().autotrack_dspy()\n",
      "-    langwatch.get_current_trace().update(\n",
      "-        metadata={\"labels\": [\"dspy\", \"thread\"], \"thread_id\": \"90210\"},\n",
      "-    )\n",
      "+    # Get the current LangWatch trace and enable DSPy autotracking\n",
      "+    current_trace = langwatch.get_current_trace()\n",
      "+    if current_trace:\n",
      "+        current_trace.autotrack_dspy()\n",
      " \n",
      "     msg = cl.Message(\n",
      "         content=\"\",\n",
      "@@ -60,4 +60,3 @@     await msg.update()\n",
      " \n",
      "     return prediction.answer\n",
      "-\n",
      "```\n"
     ]
    }
   ],
   "source": [
    "import difflib\n",
    "\n",
    "def file_diff(str1, str2, filename1=\"file1\", filename2=\"file2\"):\n",
    "    lines1 = str1.replace(\"```python\", \"\").replace(\"```\", \"\").splitlines(keepends=True)\n",
    "    lines2 = str2.replace(\"```python\", \"\").replace(\"```\", \"\").splitlines(keepends=True)\n",
    "\n",
    "    diff = difflib.unified_diff(\n",
    "        lines1, lines2,\n",
    "        fromfile=filename1,\n",
    "        tofile=filename2,\n",
    "        lineterm=''\n",
    "    )\n",
    "\n",
    "    diff_lines = list(diff)\n",
    "    diff_output = ''.join(diff_lines)\n",
    "\n",
    "    # Count changed lines (lines starting with + or - but not +++ or --- headers)\n",
    "    # Also ignore empty lines (whitespace-only lines)\n",
    "    changed_count = sum(1 for line in diff_lines\n",
    "                       if line.startswith(('+', '-'))\n",
    "                       and not line.startswith(('+++', '---'))\n",
    "                       and line[1:].strip())  # Check if content after +/- is not empty/whitespace\n",
    "\n",
    "    return f\"```diff\\n{diff_output}```\", changed_count\n",
    "\n",
    "diff, lines_changed_count = file_diff(df.iloc[0][\"expected\"], result)\n",
    "print(lines_changed_count, \"lines changed\")\n",
    "print(diff)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "55813d71",
   "metadata": {},
   "source": [
    "# Evaluate"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "984f2268",
   "metadata": {},
   "source": [
    "Now it all comes together, we will take each test case, run it through the agent, and then both take the diff metric and use an LLM-as-a-judge to evaluate if the migration is correct enough.\n",
    "\n",
    "Other validations like linting or compiling could be added here, but for now we'll keep it simple.\n",
    "\n",
    "Then we log each row with `evaluation.log` to capture the diff metric and the diff output for better visibility. Any other metric could be logged the same way.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "id": "404bd4b9",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Follow the results at: https://app.langwatch.ai/inbox-narrator/experiments/mcp-server-docs-setup?runId=offbeat-yellow-dingo\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Evaluating: 100%|██████████| 19/19 [03:52<00:00, 12.21s/it]\n"
     ]
    }
   ],
   "source": [
    "evaluation = langwatch.evaluation.init(\"mcp-server-docs-setup\")\n",
    "\n",
    "for index, row in evaluation.loop(df.iterrows(), threads=4):\n",
    "\n",
    "    async def evaluate(index, row):\n",
    "        result = await simple_coding_agent(row[\"input\"])\n",
    "        diff, lines_changed_count = file_diff(row[\"expected\"], result)\n",
    "\n",
    "        evaluation.run(\n",
    "            \"langevals/llm_boolean\",\n",
    "            index=index,\n",
    "            name=\"LLM Judgement\",\n",
    "            data={\n",
    "                \"input\": f\"\"\"\n",
    "                    <original>\n",
    "                        {row['input']}\n",
    "                    </original>\n",
    "                    <result>\n",
    "                        {result}\n",
    "                    </result>\n",
    "                    <expected>\n",
    "                        {row['expected']}\n",
    "                    </expected>\n",
    "                \"\"\",\n",
    "            },\n",
    "            settings={\n",
    "                \"prompt\": \"\"\"\n",
    "                    You are an LLM evaluator, do a quick check if the code in the <result /> tag which was generated by\n",
    "                    an AI closely matches how the code should have been implemented as expected in the <expected /> tag.\n",
    "\n",
    "                    Additional metadata capturing is fine and encouraged, but check if the instrumentation would likely\n",
    "                    work the same.\n",
    "                \"\"\",\n",
    "            },\n",
    "        )\n",
    "\n",
    "        evaluation.log(\n",
    "            \"Diff\", index=index, data={\"diff\": diff}, score=lines_changed_count\n",
    "        )\n",
    "\n",
    "    evaluation.submit(evaluate, index, row)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8989c710",
   "metadata": {},
   "source": [
    "This notebook demonstrates how to use LangWatch to evaluate an MCP server for code instrumentation agents. Key takeaways:\n",
    "1. We now know how good our MCP server is at helping coding agents to instrument code with LangWatch for the more straightfoward cases\n",
    "2. Changes on our MCP can be easily retested via the notebook, and more requirements can be added as new fixtures to serve as test cases\n",
    "3. The LLM-as-a-judge and metrics like diff line count can help on this judgement and it can all be captured in the LangWatch dashboard for analysis"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
